Cergentis Positions of Interest: WG1 Mixture Study

2021-07-14

knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(here)
library(plotly)
library(scales)
library(kableExtra)
library(scales)

Workflow

  1. Gather all required input files (vcf and beds)
  2. Create bed files of positions of interest by taking chromosome, start, and end positions, then creating a new column where those are concatenated.
  3. Use bedtools slop to extend the start and end columns by however many bases you need (value is 1/2 total pad value)
  4. Use coveragebed (wrapper in snakemake) to calculate bases covered at positions
  5. Use bcftools annotate to annotate the vcfs with the positions of interest information
  6. Use bcftools query to pull out rows with positions of interest marks
  7. Analyze and visualize results in R.

Diagram of Snakemake rules and workflow

via

snakemake --dag | dot -Tsvg > dag.svg

Data File Collection

STEP 1

Collected necessary files. Per Nate I had to retrieve the following sets of files:

  • v4.2.1 HG002
  • v4.2.1 HG003
  • v0.6 HG002
  • v1.0.0 CMRG (small variants and structural variants) See the resources/README for details.

STEP 2

Generated bed file of positions of interest. From the “Larger Variants list to explore for Cergentis_06.21.21” excel file provided by Samantha Maragh, I took the first four columns (Chr, hg19 start, hg19 end, hg38 start, hg38 end) and moved it into a new file “poi.tsv”. Samantha added 3 new positions based on the Genomic Vision list: - 2 49304890 49316235 49532029 49543374 - 6 76727407 76749167 77437124 77458884 - 7 143127742 143196806 142824835 142893899

This resulted in a total of 22 positions of interest. For any insertions - there are no end coordiantes. Per Nate, just take the start and add +1 base. This was done for each genome, the headers removed, and placed into the two files: hg19_poi.bed & hg38_poi.bed

NOTE for deletions where there was no end coordinate, I made the end coordinate +1 start coordinate. I also added a column to each bed file where I concatenated the chrom_start_end as unique variant IDs (will be column VOI later)

These are all in the data/ directory. See data/README for details.

STEP 3

Nate & I determined I would try to do the work via the command line first, manually, then adapt it to Snakemake. This work is contained in the Commands section of this document.

Commands

Determining which flags need to be used, resolving any issues locally before porting over the process to Snakemake. Starting with GRCh37 50 kb to test.

Bedtools slop

Snakemake will use this wrapper: https://snakemake-wrappers.readthedocs.io/en/stable/wrappers/bedtools/slop.html

The code in the wrapper:

__author__ = "Jan Forster"
__copyright__ = "Copyright 2019, Jan Forster"
__email__ = "j.forster@dkfz.de"
__license__ = "MIT"

from snakemake.shell import shell

## Extract arguments
extra = snakemake.params.get("extra", "")

log = snakemake.log_fmt_shell(stdout=True, stderr=True)

shell(
    "(bedtools slop"
    " {extra}"
    " -i {snakemake.input[0]}"
    " -g {snakemake.params.genome}"
    " > {snakemake.output})"
    " {log}"
)

So command line here would be:

Add 50kb

bedtools slop -b 25000 -i cergentis_variant_surveillance-master/data/hg19_poi.bed -g cergentis_variant_surveillance-master/data/hg19_genome.txt > hg19_poi_50kb_slop.bed

-b 25000 adds 25kb on either side of the position of interest, totalling 50 kb padding.

Bedtools coverage

Will use snakemake wrapper: https://snakemake-wrappers.readthedocs.io/en/stable/wrappers/bedtools/coveragebed.html Wrapper code:

__author__ = "Patrik Smeds"
__copyright__ = "Copyright 2019, Patrik Smeds"
__email__ = "patrik.smeds@gmail.com"
__license__ = "MIT"


from snakemake.shell import shell

shell.executable("bash")

log = snakemake.log_fmt_shell(stdout=False, stderr=True)

extra_params = snakemake.params.get("extra", "")

input_a = snakemake.input.a
input_b = snakemake.input.b

output_file = snakemake.output[0]

if not isinstance(output_file, str) and len(snakemake.output) != 1:
    raise ValueError("Output should be one file: " + str(output_file) + "!")

shell(
    "coverageBed"
    " -a {input_a}"
    " -b {input_b}"
    " {extra_params}"
    " > {output_file}"
    " {log}"
)

Requires bam but Nate’s MRG snakefile uses a bed file:

rule calc_gene_coverage:
    input:
        a=ensembl_dir + "/{ref}_mrg_full_{region}.bed",
        b=benchdir + "/HG002_{ref}_{benchmarkset}_{benchtype}.bed"
    output: "workflow/results/cov_tbls/HG002_{ref}_{benchmarkset}_{benchtype}_{region}_cov.tsv"
    threads: 2
    wrapper: "0.74.0/bio/bedtools/coveragebed"

50kb - GRCh37 benchmark bed

coverageBed -a /Users/sdm8/Desktop/test_output/hg19_poi_50kb_slop.bed -b /Users/sdm8/Desktop/cergentis_variant_surveillance-master/resources/HG002_GRCh37_1_22_v4.2.1_benchmark_noinconsistent.bed > ~/Desktop/test_output/hg19_50kb_GRCh37benchmark_coverage.bed

bcftools filter

Filter the vcf file according to the regions defined in the positions of interest bed file.

bcftools filter /Users/sdm8/Desktop/cergentis_variant_surveillance-master/resources/HG002_GRCh37_1_22_v4.2.1_benchmark.vcf.gz -R /Users/sdm8/Desktop/test_output/hg19_poi_50kb_slop.bed > ~/Desktop/test_output/HG002_GRCh37_1_22_v4.2.1_benchmark_filtered.vcf

bcftools annotate

Make a header text file to define what the header text will be.

echo '##INFO=<ID=VOI,Number=1,Type=String,Description="Variant of Interest ID">' > ~/Desktop/header.txt 

Annotate the sorted bed file.

bcftools annotate -a /Users/sdm8/Desktop/test_output/hg19_poi_50kb_slop.bed -c 'CHROM,FROM,TO,INFO/VOI' -h ~/Desktop/test_output/header.txt ~/Desktop/test_output/HG002_GRCh37_1_22_v4.2.1_benchmark_filtered.vcf > ~/Desktop/test_output/HG002_GRCh37_1_22_v4.2.1_benchmark_filtered_annotated.vcf 

bcftools query

bcftools query -f '%CHROM\t%POS\t[ %GT]\t%TYPE\t%INFO/VOI\t%REF\t%ALT\n' ~/Desktop/test_output/HG002_GRCh37_1_22_v4.2.1_benchmark_filtered_annotated.vcf > ~/Desktop/HG002_GRCh37_1_22_v4.2.1_benchmark_filtered_annotated.tsv

Run snakemake:

conda activate snakemake-tutorial 
snakemake --cores 1 --use-conda

(Had to remove the filter step - could not get it to work. will instead filter in R)

Query coverage bed file cols:

n_regions is the number of regions in the bed file overlapping the gene. For your work, this is the number of region (rows of the benchmark regions bed file) that overlap with the window around a variant of interest).

bp_cov is the number of bases that are covered by the benchmark regions. In your case this is the number of bases in the 50 or 100 kb window around a variant of interest that is included in the benchmark regions.

cov is the fraction of bases covered

## read in snakemake file and display
cat(readLines('~/Desktop/cergentis_variant_surveillance/workflow/snakefile'), sep = '\n')
## from snakemake.utils import min_version
## 
## ## set minimum snakemake version
## min_version("6.4.0")
## 
## ###########################
## ## DEFINE WILDCARDS
## ###########################
## 
## SLOPS = ["50000"]
## BENCHSETS = ["v4_smallvar", "v0.6.2_SVs-Tier1-noVDJorXorY"]
## GIABIDS = ["HG002", "HG003"]
## 
## ## Add wildcard constraints
## wildcard_constraints:
##     slop="|".join(SLOPS),
##     giab_id="|".join(GIABIDS),
##     benchset="|".join(BENCHSETS)
## 
## ###########################
## ## RULES
## ###########################
## 
## ## define all important and final outputs 
## ## tells snakemake to do all rules that make these files
## rule all:
##     input: 
##         expand("results/query/{giab_id}_GRCh37_v4_smallvar_{slop}_anno.tsv", 
##                 giab_id = GIABIDS, slop = SLOPS),
##         expand("results/query/HG002_GRCh37_v0.6.2_SVs-Tier1-noVDJorXorY_{slop}_anno.tsv", 
##                 slop = SLOPS),   
##         expand("results/coveragebed/poi_{slop}_HG002_GRCh37_v0.6.2_SVs-Tier1-noVDJorXorY_cov.bed", 
##                 slop = SLOPS),
##         expand("results/coveragebed/poi_{slop}_{giab_id}_GRCh37_v4_smallvar_cov.bed", 
##                 giab_id = GIABIDS, slop = SLOPS)
## 
## 
## ## Prepare and Rename input benchmarkset files
## rule prepare_v4:
##     input:
##         v4_smallvar_vcf = "resources/{giab_id}_GRCh37_1_22_v4.2.1_benchmark.vcf.gz",
##         v4_smallvar_bed = "resources/{giab_id}_GRCh37_1_22_v4.2.1_benchmark_noinconsistent.bed"
##     output:
##         v4_smallvar_vcf = "workflow/data/benchmark_sets/{giab_id}_GRCh37_{benchset}.vcf.gz",
##         v4_smallvar_bed = "workflow/data/benchmark_sets/{giab_id}_GRCh37_{benchset}.bed"
##     shell: """
##         cp {input.v4_smallvar_vcf} {output.v4_smallvar_vcf}
##         cp {input.v4_smallvar_bed} {output.v4_smallvar_bed}
##         """
## 
## rule prepare_sv:
##     input:
##         sv_tier1_vcf = "resources/{giab_id}_v0.6.2_SVs-Tier1-noVDJorXorY.vcf.gz",
##         sv_tier1_bed = "resources/{giab_id}_v0.6.2_SVs-Tier1-noVDJorXorY.bed"
##     output:
##         sv_tier1_vcf = "workflow/data/benchmark_sets/{giab_id}_GRCh37_{benchset}.vcf.gz",
##         sv_tier1_bed = "workflow/data/benchmark_sets/{giab_id}_GRCh37_{benchset}.bed"
##     shell: """
##         cp {input.vcf} {output.vcf}
##         cp {input.bed} {output.bed}
##         """
## 
## 
## ## BEDTOOLS SORT
## ## sort manually generated poi bed
## rule bedtools_sort:
##     input:
##         in_file="data/GRCh37_poi.bed",
##         genome="data/GRCh37_genome.txt"
##     output:"results/sorted/GRCh37_poi_sorted.bed"
##     wrapper: "0.77.0/bio/bedtools/sort"
## 
## 
## ##  BEDTOOLS SLOP
## ## add 50 and 100 kb surrounding positions of interest
## rule bedtools_pad:
##     input: "results/sorted/GRCh37_poi_sorted.bed"
##     output: "results/slop/GRCh37_poi_{slop}.bed"
##     params:
##         genome="data/GRCh37_genome.txt",
##         ## Add optional parameters
##         extra = "-b {slop}" 
##     log: "logs/slop/GRCh37_poi_{slop}.bed.log"
##     wrapper: "v0.75.0/bio/bedtools/slop"
## 
## ## BEDTOOLS COVERAGE
## ## Calulate number of included bases for each benchmark set in the 50 and 100 kb windows
## rule coverageBed:
##     input:
##         a="results/slop/GRCh37_poi_{slop}.bed",
##         b="workflow/data/benchmark_sets/{giab_id}_GRCh37_{benchset}.bed"
##     output: "results/coveragebed/poi_{slop}_{giab_id}_GRCh37_{benchset}_cov.bed"
##     log: "logs/coveragebed/poi_{slop}_{giab_id}_GRCh37_{benchset}_cov.log"
##     threads: 6
##     wrapper: "0.76.0/bio/bedtools/coveragebed"
## 
## 
## ## BCFTOOLS ANNOTATE
## ## make header file for annotate_vcf rule
## rule make_hdr_file:
##     output: "workflow/data/header.txt"
##     shell: """
##         echo '##INFO=<ID=VOI,Number=1,Type=String,Description="Variant of Interest ID">' > {output}
##     """
## 
## 
## rule annotate_vcf:
##     input:
##         bed="results/slop/GRCh37_poi_{slop}.bed",
##         header="workflow/data/header.txt",
##         vcf="workflow/data/benchmark_sets/{giab_id}_GRCh37_{benchset}.vcf.gz"
##     output:"results/annotate/{giab_id}_GRCh37_{benchset}_{slop}_anno.vcf"
##     conda: "envs/bcftools.yml"
##     shell: """
##     bcftools annotate \
##         -a  {input.bed} \
##         -c CHROM,FROM,TO,INFO/VOI \
##         -h {input.header} \
##         {input.vcf} \
##         > {output}
##     """
## 
## 
## ## BCFTOOLS QUERY
## ## Make variant table
## rule make_smallvar_tbl:
##     input:"results/annotate/{giab_id}_GRCh37_{benchset}_{slop}_anno.vcf"
##     output:"results/query/{giab_id}_GRCh37_{benchset}_{slop}_anno.tsv"
##     conda:"envs/bcftools.yml"
##     shell: """
##         bcftools query \
##             -f '%CHROM\\t%POS\\t[ %GT]\\t%TYPE\\t%INFO/VOI\\t%REF\\t%ALT\\n' \
##             {input} \
##             > {output}
##     """

Analysis

Run Notes

We used only GRCh37 so we could include the SV input files (only performed on GRCh37). Final positions will be converted to GRCh38.

We used only the small variant and SV (excluding the CMRG) as input.

All end posiitons of the GRCh38 poi bed file was start+1 to maintain a constant 50kb window (the extra 1 base is regarded as okay).

Objective:

Catalog all variants (both small and structural variants) within 50 kb of structural variants previously identified for the GEC mixture study.

Approach:

  1. Make a summary df with all of the result tsv files (add col for ID)
  2. Filter based on INFO/VOI tag in INFO columns (those without = outside of poi range)
  3. Summarise: At each poi: the number of variants, size distribution, type of variants

ID the poi’s with the widest range of variants (#, type diversity, size range)

Positions of Interest

GRCh37

The end positions were all start+1 so as to keep the padding in a 50kb window at all times (excluding those positions whose variants were large extending the end position nearing or exceeding 50kb).

(grch37_poi_bed <- read.table(here("data/GRCh37_poi.bed"),header = FALSE, sep="\t",stringsAsFactors=FALSE, quote="") %>%
  rename(chromosome = V1, start_coord = V2, end_coord = V3, chrom_start_end = V4)) 
##    chromosome start_coord end_coord        chrom_start_end
## 1           2   206241116 206241117  2_206241116_206241117
## 2           2    49561482  49561483    2_49561482_49561483
## 3           3   127859142 127859143  3_127859142_127859143
## 4           3   162512133 162512134  3_162512133_162512134
## 5           4   103764056 103764057  4_103764056_103764057
## 6           5   117868224 117868225  5_117868224_117868225
## 7           6   156081224 156081225  6_156081224_156081225
## 8           7    16171773  16171774    7_16171773_16171774
## 9           9    10163173  10163174    9_10163173_10163174
## 10         10    66299841  66299842   10_66299841_66299842
## 11         10    81695836  81695837   10_81695836_81695837
## 12         10    82524373  82524374   10_82524373_82524374
## 13         12    53499604  53499605   12_53499604_53499605
## 14         12    38824344  38824345   12_38824344_38824345
## 15         13    90154843  90154844   13_90154843_90154844
## 16         13    42223740  42223741   13_42223740_42223741
## 17         14   105149647 105149648 14_105149647_105149648
## 18         15    79844900  79844901   15_79844900_79844901
## 19         16    82791081  82791082   16_82791081_82791082
## 20          2    49532029  49532030    2_49532029_49532030
## 21          6    77437124  77437125    6_77437124_77437125
## 22          7   142824835 142824836  7_142824835_142824836

Load & Munge Data

Read in the Snakemake results files and organize into annotated data frames.

##################### QUERY RESULT FILES
poi_defns <- read_tsv(file=here("data/GRCh37_POI_definitions.tsv"), col_names=FALSE) %>%
  rename(chromosome = X1, start = X2, end = X3, VOI = X4, initial_var_type = X5, initial_var_size = X6) %>%
  select(-chromosome, -start, -end)


## get list of files in results directory
file_list <- as.list(list.files(path = here("results/query"), 
                           pattern = "anno.tsv",
                           include.dirs = TRUE, full.names = TRUE, 
                           recursive = TRUE))

## strip the path portion of the file names
file_names <- str_remove(file_list, paste0(here("results/query"),"/"))

## organize the file list to read them into a df
df_lst <- set_names(file_list, file_names)

## make that into a data frame
metadata_df <- tibble(query_file = unlist(file_names)) %>% 
  separate(query_file, c("giab_id","ref_genome", "benchset", "bench-type", "pad_size", "temp"), 
           sep = "_", remove = FALSE) %>% 
  select(-temp)

## read in files and create 1 df
cergentis_df <- df_lst %>%
  map_dfr(read_tsv, .id = "query_file", col_names = c("chrom", "pos", "GT", "var_type", "VOI", "ref", "alt")) %>%
  ## filter out regions outside of poi (no VOI entry)
  filter(!VOI == ".")  

## Annotating with sample metadata
cergentis_anno_df <- cergentis_df %>% 
  left_join(metadata_df) %>%
  left_join(poi_defns, by = "VOI")

##################### COVERAGE RESULT FILES

## get files in a list to read in
cov_files <- list.files(path = here("results/coveragebed/"),
                             pattern = ".bed") %>%
  set_names(~ str_remove(., "_cov.bed")) %>%
  set_names(~ str_remove(., "poi_")) %>%
  map(~ here("results/coveragebed/", .))

## build df
cov_df <- cov_files %>%
  ## read in files
  map_dfr(
    read_tsv,
    col_names = c(
      "chrom",
      "start",
      "end",
      "chrom_start_end",
      "n_regions",
      "bp_cov",
      "gene_size",
      "cov"),
    col_types = "ciiciiid",
    .id = "covset"
  ) %>%
  ## separate the file name into columns with metadata
  separate(
    col = covset,
    into = c("giab_id", "ref", "benchmark", "bench_type", "region"),
    sep = "_",
    remove = TRUE
  )

HG002

Visualize & Summarize Variants

Prepare Data

Filter data and organize into dataframes for visualization.

# filter to include only HG002
HG002_cergentis_anno_df <- cergentis_anno_df %>%
  filter(giab_id == "HG002")

################# CALCULATE VARIANT SIZE

## Calculating INDEL Variant Size
get_var_size <- function(ref, alt){
  ref_bp <- nchar(ref)
  
  ## Dealing with multiple ALTs
  # --- using largest SV
  if(str_detect(alt, ",")){
    alts <- unlist(str_split(alt, ","))
    alts_bp <- nchar(alts)
    var_sizes <- alts_bp - ref_bp
    
    var_size <- var_sizes[abs(var_sizes) == max(abs(var_sizes))]
    
    ## returning first entry value if the two genotypes are the same size
    return(var_size[1])
  }
  return(nchar(alt) - ref_bp)
}


## Getting variant size into df
HG002_sized_var_df <- HG002_cergentis_anno_df %>%
  rowwise() %>%
  mutate(var_size = get_var_size(ref, alt))

################## COUNTS VARIANTS PER POI

# count how many variants are within each poi
HG002_num_vars <- HG002_sized_var_df %>%
  group_by(VOI, pad_size) %>%
  tally() %>%
  rename(num_vars = n)

# count how many of each type of variant is within each poi
HG002_type_vars <- HG002_sized_var_df %>%
  group_by(VOI, var_type, pad_size) %>%
  tally() %>%
  pivot_wider(names_from = var_type, values_from = n, values_fill = 0) %>%
  rename(num_SNP = SNP, num_INDEL = INDEL, num_SNP_INDEL = 'SNP,INDEL')

# df where variant type counts are summarized
HG002_var_counts_summary_df <- HG002_num_vars %>%
  left_join(HG002_type_vars, by = "VOI") %>%
  select(-pad_size.x, -pad_size.y)

# the sum of counts of types of vars should = num of vars total
HG002_sanity_check <- HG002_var_counts_summary_df %>%
  rowwise() %>%
  mutate(total = sum(num_SNP,num_INDEL,num_SNP_INDEL, na.rm = T)) %>%
  mutate(check = (num_vars == total)) %>%
  filter(check != TRUE) %>%
  nrow(.) == 0

# add the var count and types into the var size df
HG002_variant_df <- HG002_sized_var_df %>%
  left_join(HG002_num_vars, by = "VOI") %>%
  left_join(HG002_type_vars, by = "VOI")%>%
  mutate(coord = paste(chrom, pos, sep="_"))

write.table(HG002_variant_df, file = here("results/analysis/HG002_variant_df.tsv"), sep = "\t", row.names = FALSE)

Variant Type per POI

Per position of interest (POI) how many of each type of variant exists across the reference benchmarks within 50 kb on either side?

HG002_var_counts_summary_plot <- HG002_var_counts_summary_df %>%
  select(-num_vars) %>%
  rename(SNP = num_SNP, INDEL = num_INDEL, SNP_INDEL = num_SNP_INDEL) %>%
  pivot_longer(names_to = "var_type", values_to = "count", 
               cols = c("SNP", "INDEL", "SNP_INDEL"))

# color palette definition
cbPalette <- c("#000000", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")  

# bar plot of counts per variant type
ggplotly(ggplot(HG002_var_counts_summary_plot) + 
    geom_bar(aes(x = VOI, y = count, fill = var_type), 
             position = position_dodge(preserve = "single"), stat = "identity") +
       scale_y_continuous(labels = comma) +
    theme_bw() + 
    theme(axis.text.x = element_text(angle = 90)) + 
    labs(x = "Variant Position of Interest (chrom_start_stop)", y = "Total Count", 
         fill = "Variant Type") + scale_fill_manual(values=cbPalette))
# table view of plot
(HG002_var_counts_summary_df %>%
  kbl() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive", html_font = "Cambria")))
VOI num_vars num_INDEL num_SNP num_SNP_INDEL
10_66299841_66299842 182 36 146 0
10_81695836_81695837 398 46 352 0
10_82524373_82524374 440 34 406 0
12_38824344_38824345 222 36 184 2
12_53499604_53499605 348 68 280 0
13_42223740_42223741 322 40 282 0
13_90154843_90154844 392 92 300 0
14_105149647_105149648 208 24 184 0
15_79844900_79844901 596 78 518 0
16_82791081_82791082 592 52 540 0
2_206241116_206241117 480 62 418 0
2_49532029_49532030 390 62 328 0
2_49561482_49561483 26 8 18 0
3_127859142_127859143 210 48 162 0
3_162512133_162512134 140 30 110 0
4_103764056_103764057 354 64 290 0
5_117868224_117868225 482 54 428 0
6_156081224_156081225 188 28 160 0
6_77437124_77437125 266 28 238 0
7_142824835_142824836 2 2 0 0
7_16171773_16171774 318 34 284 0
9_10163173_10163174 326 42 284 0

Variant Size per POI

Per position of interest (POI) what are the size ranges of variants that exist across the reference benchmarks within 50 kb on either side?

SNPs

########### CREATE ROOT PLOT DF
## create a summary df with the types, sizes, and counts of variants
HG002_snp_variant_plot_df <- HG002_variant_df %>%
  filter(var_type == "SNP") %>%
  select(VOI, var_type, var_size) %>%
  group_by(VOI) %>%
  summarise(count = n())  ## count how many of each size per poi


########## PLOT 
## organize all poi plots in a gallery view 
ggplotly(ggplot(HG002_snp_variant_plot_df) +
    geom_bar(aes(x = VOI, y = count), 
              stat = "identity") +
        scale_y_continuous(labels = comma) +
    theme_bw() + 
    theme(axis.text.x = element_text(angle = 90)) + 
    labs(x = "Variant Variant Size", y = "Count",
         fill = "Variant Type", title = "Number of SNP Variants per POI") + scale_fill_manual(values=cbPalette))
#(snp_variant_plot_df %>%
#  kbl() %>%
#  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive", html_font = "Cambria")))

INDEL

########### CREATE ROOT PLOT DF
## create a summary df with the types, sizes, and counts of variants
HG002_indel_variant_plot_df <- HG002_variant_df %>%
  filter(var_type == "INDEL") %>%
  select(VOI, var_type, var_size) %>%
  group_by(VOI, var_size) %>%
  summarise(count = n())  ## count how many of each size per poi


############## SEPARATE DFS
## too many VOIs to render plot nicely - split up and render separately 
VOI_lst <- c("2_206241116_206241117", "2_49532029_49532030","2_49561482_49561483",
             "3_127859142_127859143","3_162512133_162512134")
HG002_indel_voi_chrom_23 <- HG002_indel_variant_plot_df %>%
  filter((VOI %in% VOI_lst))

VOI_lst <- c("4_103764056_103764057", "5_117868224_117868225",
             "6_156081224_156081225","6_77437124_77437125")
HG002_indel_voi_chrom_456 <- HG002_indel_variant_plot_df %>%
  filter((VOI %in% VOI_lst))

VOI_lst <- c("7_142824835_142824836", "7_16171773_16171774","9_10163173_10163174",
             "10_66299841_66299842","10_81695836_81695837", "10_82524373_82524374")
HG002_indel_voi_chrom_7910 <- HG002_indel_variant_plot_df %>%
  filter((VOI %in% VOI_lst))

VOI_lst <- c("12_38824344_38824345", "12_53499604_53499605",
             "13_42223740_42223741","13_90154843_90154844")
HG002_indel_voi_chrom_1213 <- HG002_indel_variant_plot_df %>%
  filter((VOI %in% VOI_lst))

VOI_lst <- c("14_105149647_105149648", "15_79844900_79844901","16_82791081_82791082")
HG002_indel_voi_chrom_141516 <- HG002_indel_variant_plot_df %>%
  filter((VOI %in% VOI_lst))

########## PLOT 
## organize all poi plots in a gallery view 
ggplotly(ggplot(HG002_indel_voi_chrom_23) +
    geom_bar(aes(x = var_size, y = count), position = position_dodge(preserve = "single"), stat = "identity") +
      scale_x_continuous(breaks = pretty_breaks(n=10)) +
      scale_y_continuous(breaks = pretty_breaks(n=10)) +
      facet_wrap(~VOI) + 
      theme_bw() + 
      theme(axis.text.x = element_text(angle = 90)) + 
      labs(x = "Variant Variant Size", y = "Count", title = "INDEL Variants per POI: chroms 2 & 3") + scale_fill_manual(values=cbPalette))
ggplotly(ggplot(HG002_indel_voi_chrom_456) +
    geom_bar(aes(x = var_size, y = count), position = position_dodge(preserve = "single"), stat = "identity") +
      scale_x_continuous(breaks = pretty_breaks(n=10)) +
      scale_y_continuous(breaks = pretty_breaks(n=10)) +
      facet_wrap(~VOI) + 
      theme_bw() + 
      theme(axis.text.x = element_text(angle = 90)) + 
      labs(x = "Variant Variant Size", y = "Count", title = "INDEL Variants per POI: chroms 4, 5, & 6") + scale_fill_manual(values=cbPalette))
ggplotly(ggplot(HG002_indel_voi_chrom_7910) +
    geom_bar(aes(x = var_size, y = count), position = position_dodge(preserve = "single"), stat = "identity") +
      scale_x_continuous(breaks = pretty_breaks(n=10)) +
      scale_y_continuous(breaks = pretty_breaks(n=10)) +
      facet_wrap(~VOI) + 
      theme_bw() + 
      theme(axis.text.x = element_text(angle = 90)) + 
      labs(x = "Variant Variant Size", y = "Count", title = "INDEL Variants per POI: chroms 7, 0, & 10") + scale_fill_manual(values=cbPalette))
ggplotly(ggplot(HG002_indel_voi_chrom_1213) +
    geom_bar(aes(x = var_size, y = count), position = position_dodge(preserve = "single"), stat = "identity") +
      scale_x_continuous(breaks = pretty_breaks(n=10)) +
      scale_y_continuous(breaks = pretty_breaks(n=10)) +
      facet_wrap(~VOI) + 
      theme_bw() + 
      theme(axis.text.x = element_text(angle = 90)) + 
      labs(x = "Variant Variant Size", y = "Count", title = "INDEL Variants per POI: chroms 12 & 13") + scale_fill_manual(values=cbPalette))
ggplotly(ggplot(HG002_indel_voi_chrom_141516) +
    geom_bar(aes(x = var_size, y = count), position = position_dodge(preserve = "single"), stat = "identity") +
      scale_x_continuous(breaks = pretty_breaks(n=10)) +
      scale_y_continuous(breaks = pretty_breaks(n=10)) +
      facet_wrap(~VOI) + 
      theme_bw() + 
      theme(axis.text.x = element_text(angle = 90)) + 
      labs(x = "Variant Variant Size", y = "Count", title = "INDEL Variants per POI: chroms 14, 15, & 16") + scale_fill_manual(values=cbPalette))
########## TABLE
#(indel_variant_plot_df %>%
#  kbl() %>%
#  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive", html_font = "Cambria")))

SNP or INDEL

########### CREATE ROOT PLOT DF
## create a summary df with the types, sizes, and counts of variants
HG002_snp_indel_variant_plot_df <- HG002_variant_df %>%
  filter(var_type == "SNP,INDEL") %>%
  select(VOI, var_type, var_size) %>%
  group_by(VOI, var_size) %>%
  summarise(count = n())  ## count how many of each size per poi
## `summarise()` has grouped output by 'VOI'. You can override using the `.groups` argument.
########## PLOT 
## organize all poi plots in a gallery view 
ggplotly(ggplot(HG002_snp_indel_variant_plot_df) +
    geom_bar(aes(x = var_size, y = count), position = position_dodge(preserve = "single"), stat = "identity") +
      facet_wrap(~VOI) + 
      theme_bw() + 
      theme(axis.text.x = element_text(angle = 90)) + 
      labs(x = "Variant Variant Size", y = "Count") + scale_fill_manual(values=cbPalette))
(HG002_snp_indel_variant_plot_df %>%
  kbl() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive", html_font = "Cambria")))
VOI var_size count
12_38824344_38824345 6 2

HG003

Visualize & Summarize Variants

Prepare Data

Filter data and organize into dataframes for visualization.

# filter to include only HG003
HG003_cergentis_anno_df <- cergentis_anno_df %>%
  filter(giab_id == "HG003")

################# CALCULATE VARIANT SIZE

## Calculating INDEL Variant Size
get_var_size <- function(ref, alt){
  ref_bp <- nchar(ref)
  
  ## Dealing with multiple ALTs
  # --- using largest SV
  if(str_detect(alt, ",")){
    alts <- unlist(str_split(alt, ","))
    alts_bp <- nchar(alts)
    var_sizes <- alts_bp - ref_bp
    
    var_size <- var_sizes[abs(var_sizes) == max(abs(var_sizes))]
    
    ## returning first entry value if the two genotypes are the same size
    return(var_size[1])
  }
  return(nchar(alt) - ref_bp)
}


## Getting variant size into df
HG003_sized_var_df <- HG003_cergentis_anno_df %>%
  rowwise() %>%
  mutate(var_size = get_var_size(ref, alt))

################## COUNTS VARIANTS PER POI

# count how many variants are within each poi
HG003_num_vars <- HG003_sized_var_df %>%
  group_by(VOI, pad_size) %>%
  tally() %>%
  rename(num_vars = n)

# count how many of each type of variant is within each poi
HG003_type_vars <- HG003_sized_var_df %>%
  group_by(VOI, var_type, pad_size) %>%
  tally() %>%
  pivot_wider(names_from = var_type, values_from = n, values_fill = 0) %>%
  rename(num_SNP = SNP, num_INDEL = INDEL, num_SNP_INDEL = 'SNP,INDEL')

# df where variant type counts are summarized
HG003_var_counts_summary_df <- HG003_num_vars %>%
  left_join(HG003_type_vars, by = "VOI") %>%
  select(-pad_size.x, -pad_size.y)

# the sum of counts of types of vars should = num of vars total
HG003_sanity_check <- HG003_var_counts_summary_df %>%
  rowwise() %>%
  mutate(total = sum(num_SNP,num_INDEL,num_SNP_INDEL, na.rm = T)) %>%
  mutate(check = (num_vars == total)) %>%
  filter(check != TRUE) %>%
  nrow(.) == 0

# add the var count and types into the var size df
HG003_variant_df <- HG003_sized_var_df %>%
  left_join(HG003_num_vars, by = "VOI") %>%
  left_join(HG003_type_vars, by = "VOI")%>%
  mutate(coord = paste(chrom, pos, sep="_"))

write.table(HG003_variant_df, file = here("results/analysis/HG003_variant_df.tsv"), sep = "\t", row.names = FALSE)

Variant Type per POI

Per position of interest (POI) how many of each type of variant exists across the reference benchmarks within 50 kb on either side?

HG003_var_counts_summary_plot <- HG003_var_counts_summary_df %>%
  select(-num_vars) %>%
  rename(SNP = num_SNP, INDEL = num_INDEL, SNP_INDEL = num_SNP_INDEL) %>%
  pivot_longer(names_to = "var_type", values_to = "count", 
               cols = c("SNP", "INDEL", "SNP_INDEL"))

# color palette definition
cbPalette <- c("#000000", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")  

# bar plot of counts per variant type
ggplotly(ggplot(HG003_var_counts_summary_plot) + 
    geom_bar(aes(x = VOI, y = count, fill = var_type), 
             position = position_dodge(preserve = "single"), stat = "identity") +
       scale_y_continuous(labels = comma) +
    theme_bw() + 
    theme(axis.text.x = element_text(angle = 90)) + 
    labs(x = "Variant Position of Interest (chrom_start_stop)", y = "Total Count", 
         fill = "Variant Type") + scale_fill_manual(values=cbPalette))
# table view of plot
(HG003_var_counts_summary_df %>%
  kbl() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive", html_font = "Cambria")))
VOI num_vars num_INDEL num_SNP num_SNP_INDEL
10_66299841_66299842 75 15 60 0
10_81695836_81695837 8 4 4 0
10_82524373_82524374 60 5 55 0
12_38824344_38824345 158 27 130 1
12_53499604_53499605 28 6 22 0
13_42223740_42223741 52 6 46 0
13_90154843_90154844 205 41 164 0
14_105149647_105149648 103 17 86 0
15_79844900_79844901 136 15 120 1
16_82791081_82791082 246 23 223 0
2_206241116_206241117 281 32 249 0
2_49532029_49532030 149 25 124 0
2_49561482_49561483 41 8 33 0
3_127859142_127859143 35 9 26 0
3_162512133_162512134 60 13 47 0
4_103764056_103764057 9 2 7 0
5_117868224_117868225 156 17 139 0
6_156081224_156081225 110 20 90 0
6_77437124_77437125 121 9 112 0
7_142824835_142824836 1 1 0 0
7_16171773_16171774 111 14 97 0
9_10163173_10163174 266 39 227 0

Variant Size per POI

Per position of interest (POI) what are the size ranges of variants that exist across the reference benchmarks within 50 kb on either side?

SNPs

########### CREATE ROOT PLOT DF
## create a summary df with the types, sizes, and counts of variants
HG003_snp_variant_plot_df <- HG003_variant_df %>%
  filter(var_type == "SNP") %>%
  select(VOI, var_type, var_size) %>%
  group_by(VOI) %>%
  summarise(count = n())  ## count how many of each size per poi


########## PLOT 
## organize all poi plots in a gallery view 
ggplotly(ggplot(HG003_snp_variant_plot_df) +
    geom_bar(aes(x = VOI, y = count), 
              stat = "identity") +
        scale_y_continuous(labels = comma) +
    theme_bw() + 
    theme(axis.text.x = element_text(angle = 90)) + 
    labs(x = "Variant Variant Size", y = "Count",
         fill = "Variant Type", title = "Number of SNP Variants per POI") + scale_fill_manual(values=cbPalette))
#(snp_variant_plot_df %>%
#  kbl() %>%
#  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive", html_font = "Cambria")))

INDEL

########### CREATE ROOT PLOT DF
## create a summary df with the types, sizes, and counts of variants
HG003_indel_variant_plot_df <- HG003_variant_df %>%
  filter(var_type == "INDEL") %>%
  select(VOI, var_type, var_size) %>%
  group_by(VOI, var_size) %>%
  summarise(count = n())  ## count how many of each size per poi


############## SEPARATE DFS
## too many VOIs to render plot nicely - split up and render separately 
VOI_lst <- c("2_206241116_206241117", "2_49532029_49532030","2_49561482_49561483",
             "3_127859142_127859143","3_162512133_162512134")
HG003_indel_voi_chrom_23 <- HG003_indel_variant_plot_df %>%
  filter((VOI %in% VOI_lst))

VOI_lst <- c("4_103764056_103764057", "5_117868224_117868225",
             "6_156081224_156081225","6_77437124_77437125")
HG003_indel_voi_chrom_456 <- HG003_indel_variant_plot_df %>%
  filter((VOI %in% VOI_lst))

VOI_lst <- c("7_142824835_142824836", "7_16171773_16171774","9_10163173_10163174",
             "10_66299841_66299842","10_81695836_81695837", "10_82524373_82524374")
HG003_indel_voi_chrom_7910 <- HG003_indel_variant_plot_df %>%
  filter((VOI %in% VOI_lst))

VOI_lst <- c("12_38824344_38824345", "12_53499604_53499605",
             "13_42223740_42223741","13_90154843_90154844")
HG003_indel_voi_chrom_1213 <- HG003_indel_variant_plot_df %>%
  filter((VOI %in% VOI_lst))

VOI_lst <- c("14_105149647_105149648", "15_79844900_79844901","16_82791081_82791082")
HG003_indel_voi_chrom_141516 <- HG003_indel_variant_plot_df %>%
  filter((VOI %in% VOI_lst))

########## PLOT 
## organize all poi plots in a gallery view 
ggplotly(ggplot(HG003_indel_voi_chrom_23) +
    geom_bar(aes(x = var_size, y = count), position = position_dodge(preserve = "single"), stat = "identity") +
      scale_x_continuous(breaks = pretty_breaks(n=10)) +
      scale_y_continuous(breaks = pretty_breaks(n=10)) +
      facet_wrap(~VOI) + 
      theme_bw() + 
      theme(axis.text.x = element_text(angle = 90)) + 
      labs(x = "Variant Variant Size", y = "Count", title = "INDEL Variants per POI: chroms 2 & 3") + scale_fill_manual(values=cbPalette))
ggplotly(ggplot(HG003_indel_voi_chrom_456) +
    geom_bar(aes(x = var_size, y = count), position = position_dodge(preserve = "single"), stat = "identity") +
      scale_x_continuous(breaks = pretty_breaks(n=10)) +
      scale_y_continuous(breaks = pretty_breaks(n=10)) +
      facet_wrap(~VOI) + 
      theme_bw() + 
      theme(axis.text.x = element_text(angle = 90)) + 
      labs(x = "Variant Variant Size", y = "Count", title = "INDEL Variants per POI: chroms 4, 5, & 6") + scale_fill_manual(values=cbPalette))
ggplotly(ggplot(HG003_indel_voi_chrom_7910) +
    geom_bar(aes(x = var_size, y = count), position = position_dodge(preserve = "single"), stat = "identity") +
      scale_x_continuous(breaks = pretty_breaks(n=10)) +
      scale_y_continuous(breaks = pretty_breaks(n=10)) +
      facet_wrap(~VOI) + 
      theme_bw() + 
      theme(axis.text.x = element_text(angle = 90)) + 
      labs(x = "Variant Variant Size", y = "Count", title = "INDEL Variants per POI: chroms 7, 0, & 10") + scale_fill_manual(values=cbPalette))
ggplotly(ggplot(HG003_indel_voi_chrom_1213) +
    geom_bar(aes(x = var_size, y = count), position = position_dodge(preserve = "single"), stat = "identity") +
      scale_x_continuous(breaks = pretty_breaks(n=10)) +
      scale_y_continuous(breaks = pretty_breaks(n=10)) +
      facet_wrap(~VOI) + 
      theme_bw() + 
      theme(axis.text.x = element_text(angle = 90)) + 
      labs(x = "Variant Variant Size", y = "Count", title = "INDEL Variants per POI: chroms 12 & 13") + scale_fill_manual(values=cbPalette))
ggplotly(ggplot(HG003_indel_voi_chrom_141516) +
    geom_bar(aes(x = var_size, y = count), position = position_dodge(preserve = "single"), stat = "identity") +
      scale_x_continuous(breaks = pretty_breaks(n=10)) +
      scale_y_continuous(breaks = pretty_breaks(n=10)) +
      facet_wrap(~VOI) + 
      theme_bw() + 
      theme(axis.text.x = element_text(angle = 90)) + 
      labs(x = "Variant Variant Size", y = "Count", title = "INDEL Variants per POI: chroms 14, 15, & 16") + scale_fill_manual(values=cbPalette))
########## TABLE
#(indel_variant_plot_df %>%
#  kbl() %>%
#  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive", html_font = "Cambria")))

SNP or INDEL

########### CREATE ROOT PLOT DF
## create a summary df with the types, sizes, and counts of variants
HG003_snp_indel_variant_plot_df <- HG003_variant_df %>%
  filter(var_type == "SNP,INDEL") %>%
  select(VOI, var_type, var_size) %>%
  group_by(VOI, var_size) %>%
  summarise(count = n())  ## count how many of each size per poi
## `summarise()` has grouped output by 'VOI'. You can override using the `.groups` argument.
########## PLOT 
## organize all poi plots in a gallery view 
ggplotly(ggplot(HG003_snp_indel_variant_plot_df) +
    geom_bar(aes(x = var_size, y = count), position = position_dodge(preserve = "single"), stat = "identity") +
      facet_wrap(~VOI) + 
      theme_bw() + 
      theme(axis.text.x = element_text(angle = 90)) + 
      labs(x = "Variant Variant Size", y = "Count") + scale_fill_manual(values=cbPalette))
(HG003_snp_indel_variant_plot_df %>%
  kbl() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive", html_font = "Cambria")))
VOI var_size count
12_38824344_38824345 6 1
15_79844900_79844901 3 1

Coverage

(cov_df %>%
  kbl() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive", html_font = "Cambria")))
giab_id ref benchmark bench_type region chrom start end chrom_start_end n_regions bp_cov gene_size cov
50000 HG002 GRCh37 v0.6.2 SVs-Tier1-noVDJorXorY 2 49482029 49582030 2_49532029_49532030 9 79461 100001 0.7946020
50000 HG002 GRCh37 v0.6.2 SVs-Tier1-noVDJorXorY 2 49511482 49611483 2_49561482_49561483 9 79625 100001 0.7962421
50000 HG002 GRCh37 v0.6.2 SVs-Tier1-noVDJorXorY 2 206191116 206291117 2_206241116_206241117 15 96235 100001 0.9623404
50000 HG002 GRCh37 v0.6.2 SVs-Tier1-noVDJorXorY 3 127809142 127909143 3_127859142_127859143 12 97554 100001 0.9755303
50000 HG002 GRCh37 v0.6.2 SVs-Tier1-noVDJorXorY 3 162462133 162562134 3_162512133_162512134 8 20624 100001 0.2062379
50000 HG002 GRCh37 v0.6.2 SVs-Tier1-noVDJorXorY 4 103714056 103814057 4_103764056_103764057 24 97890 100001 0.9788902
50000 HG002 GRCh37 v0.6.2 SVs-Tier1-noVDJorXorY 5 117818224 117918225 5_117868224_117868225 11 88364 100001 0.8836312
50000 HG002 GRCh37 v0.6.2 SVs-Tier1-noVDJorXorY 6 77387124 77487125 6_77437124_77437125 3 64357 100001 0.6435636
50000 HG002 GRCh37 v0.6.2 SVs-Tier1-noVDJorXorY 6 156031224 156131225 6_156081224_156081225 16 96115 100001 0.9611404
50000 HG002 GRCh37 v0.6.2 SVs-Tier1-noVDJorXorY 7 16121773 16221774 7_16171773_16171774 13 94218 100001 0.9421706
50000 HG002 GRCh37 v0.6.2 SVs-Tier1-noVDJorXorY 7 142774835 142874836 7_142824835_142824836 8 32058 100001 0.3205768
50000 HG002 GRCh37 v0.6.2 SVs-Tier1-noVDJorXorY 9 10113173 10213174 9_10163173_10163174 17 97277 100001 0.9727603
50000 HG002 GRCh37 v0.6.2 SVs-Tier1-noVDJorXorY 10 66249841 66349842 10_66299841_66299842 7 98570 100001 0.9856901
50000 HG002 GRCh37 v0.6.2 SVs-Tier1-noVDJorXorY 10 81645836 81745837 10_81695836_81695837 34 88776 100001 0.8877511
50000 HG002 GRCh37 v0.6.2 SVs-Tier1-noVDJorXorY 10 82474373 82574374 10_82524373_82524374 14 98663 100001 0.9866201
50000 HG002 GRCh37 v0.6.2 SVs-Tier1-noVDJorXorY 12 38774344 38874345 12_38824344_38824345 15 86144 100001 0.8614314
50000 HG002 GRCh37 v0.6.2 SVs-Tier1-noVDJorXorY 12 53449604 53549605 12_53499604_53499605 32 93009 100001 0.9300807
50000 HG002 GRCh37 v0.6.2 SVs-Tier1-noVDJorXorY 13 42173740 42273741 13_42223740_42223741 11 98893 100001 0.9889201
50000 HG002 GRCh37 v0.6.2 SVs-Tier1-noVDJorXorY 13 90104843 90204844 13_90154843_90154844 26 96980 100001 0.9697903
50000 HG002 GRCh37 v0.6.2 SVs-Tier1-noVDJorXorY 14 105099647 105199648 14_105149647_105149648 19 93893 100001 0.9389206
50000 HG002 GRCh37 v0.6.2 SVs-Tier1-noVDJorXorY 15 79794900 79894901 15_79844900_79844901 19 94816 100001 0.9481505
50000 HG002 GRCh37 v0.6.2 SVs-Tier1-noVDJorXorY 16 82741081 82841082 16_82791081_82791082 15 97592 100001 0.9759102
50000 HG002 GRCh37 v4 smallvar 2 49482029 49582030 2_49532029_49532030 9 79461 100001 0.7946020
50000 HG002 GRCh37 v4 smallvar 2 49511482 49611483 2_49561482_49561483 9 79625 100001 0.7962421
50000 HG002 GRCh37 v4 smallvar 2 206191116 206291117 2_206241116_206241117 15 96235 100001 0.9623404
50000 HG002 GRCh37 v4 smallvar 3 127809142 127909143 3_127859142_127859143 12 97554 100001 0.9755303
50000 HG002 GRCh37 v4 smallvar 3 162462133 162562134 3_162512133_162512134 8 20624 100001 0.2062379
50000 HG002 GRCh37 v4 smallvar 4 103714056 103814057 4_103764056_103764057 24 97890 100001 0.9788902
50000 HG002 GRCh37 v4 smallvar 5 117818224 117918225 5_117868224_117868225 11 88364 100001 0.8836312
50000 HG002 GRCh37 v4 smallvar 6 77387124 77487125 6_77437124_77437125 3 64357 100001 0.6435636
50000 HG002 GRCh37 v4 smallvar 6 156031224 156131225 6_156081224_156081225 16 96115 100001 0.9611404
50000 HG002 GRCh37 v4 smallvar 7 16121773 16221774 7_16171773_16171774 13 94218 100001 0.9421706
50000 HG002 GRCh37 v4 smallvar 7 142774835 142874836 7_142824835_142824836 8 32058 100001 0.3205768
50000 HG002 GRCh37 v4 smallvar 9 10113173 10213174 9_10163173_10163174 17 97277 100001 0.9727603
50000 HG002 GRCh37 v4 smallvar 10 66249841 66349842 10_66299841_66299842 7 98570 100001 0.9856901
50000 HG002 GRCh37 v4 smallvar 10 81645836 81745837 10_81695836_81695837 34 88776 100001 0.8877511
50000 HG002 GRCh37 v4 smallvar 10 82474373 82574374 10_82524373_82524374 14 98663 100001 0.9866201
50000 HG002 GRCh37 v4 smallvar 12 38774344 38874345 12_38824344_38824345 15 86144 100001 0.8614314
50000 HG002 GRCh37 v4 smallvar 12 53449604 53549605 12_53499604_53499605 32 93009 100001 0.9300807
50000 HG002 GRCh37 v4 smallvar 13 42173740 42273741 13_42223740_42223741 11 98893 100001 0.9889201
50000 HG002 GRCh37 v4 smallvar 13 90104843 90204844 13_90154843_90154844 26 96980 100001 0.9697903
50000 HG002 GRCh37 v4 smallvar 14 105099647 105199648 14_105149647_105149648 19 93893 100001 0.9389206
50000 HG002 GRCh37 v4 smallvar 15 79794900 79894901 15_79844900_79844901 19 94816 100001 0.9481505
50000 HG002 GRCh37 v4 smallvar 16 82741081 82841082 16_82791081_82791082 15 97592 100001 0.9759102
50000 HG003 GRCh37 v4 smallvar 2 49482029 49582030 2_49532029_49532030 10 79445 100001 0.7944421
50000 HG003 GRCh37 v4 smallvar 2 49511482 49611483 2_49561482_49561483 11 79501 100001 0.7950020
50000 HG003 GRCh37 v4 smallvar 2 206191116 206291117 2_206241116_206241117 19 95588 100001 0.9558704
50000 HG003 GRCh37 v4 smallvar 3 127809142 127909143 3_127859142_127859143 17 96973 100001 0.9697203
50000 HG003 GRCh37 v4 smallvar 3 162462133 162562134 3_162512133_162512134 7 20663 100001 0.2066279
50000 HG003 GRCh37 v4 smallvar 4 103714056 103814057 4_103764056_103764057 21 98498 100001 0.9849702
50000 HG003 GRCh37 v4 smallvar 5 117818224 117918225 5_117868224_117868225 15 86373 100001 0.8637214
50000 HG003 GRCh37 v4 smallvar 6 77387124 77487125 6_77437124_77437125 4 64249 100001 0.6424836
50000 HG003 GRCh37 v4 smallvar 6 156031224 156131225 6_156081224_156081225 14 96472 100001 0.9647104
50000 HG003 GRCh37 v4 smallvar 7 16121773 16221774 7_16171773_16171774 14 94044 100001 0.9404306
50000 HG003 GRCh37 v4 smallvar 7 142774835 142874836 7_142824835_142824836 7 32353 100001 0.3235268
50000 HG003 GRCh37 v4 smallvar 9 10113173 10213174 9_10163173_10163174 24 92091 100001 0.9209008
50000 HG003 GRCh37 v4 smallvar 10 66249841 66349842 10_66299841_66299842 6 98730 100001 0.9872901
50000 HG003 GRCh37 v4 smallvar 10 81645836 81745837 10_81695836_81695837 43 89273 100001 0.8927211
50000 HG003 GRCh37 v4 smallvar 10 82474373 82574374 10_82524373_82524374 21 98563 100001 0.9856201
50000 HG003 GRCh37 v4 smallvar 12 38774344 38874345 12_38824344_38824345 22 97372 100001 0.9737102
50000 HG003 GRCh37 v4 smallvar 12 53449604 53549605 12_53499604_53499605 35 93468 100001 0.9346706
50000 HG003 GRCh37 v4 smallvar 13 42173740 42273741 13_42223740_42223741 10 99195 100001 0.9919401
50000 HG003 GRCh37 v4 smallvar 13 90104843 90204844 13_90154843_90154844 31 96018 100001 0.9601704
50000 HG003 GRCh37 v4 smallvar 14 105099647 105199648 14_105149647_105149648 26 91743 100001 0.9174208
50000 HG003 GRCh37 v4 smallvar 15 79794900 79894901 15_79844900_79844901 20 94957 100001 0.9495605
50000 HG003 GRCh37 v4 smallvar 16 82741081 82841082 16_82791081_82791082 12 97945 100001 0.9794402

Analysis II: Inspection of Final POIs

Samantha went through the positions of interest and determined which 8 were best suited for the study. Those positions are (ref: GRCh37):

chrom start position
2 49532029
3 127859142
3 162512133
5 117868223
6 77437124
7 142824835
9 10163173
12 38824344

Pull all INDELs surrounding these poi’s.

HG002

HG002_pois <- c("2_49532029_49532030", "3_127859142_127859143", "3_162512133_162512134", 
                "5_117868224_117868225", "6_77437124_77437125", "7_142824835_142824836",
                "9_10163173_10163174", "12_38824344_38824345") ## list of final pois

HG002_final_poi_df <- HG002_variant_df %>% 
  filter(VOI %in% HG002_pois) %>% ## keep only the final pois
  filter(!var_type == "SNP") ## keep INDEL and SNP,INDEL types only

HG002_final_poi_plot_df <- HG002_final_poi_df %>%
  select(VOI, var_type, var_size) %>%
  group_by(VOI, var_size) %>%
  summarise(count = n())  ## count how many of each size per poi
## `summarise()` has grouped output by 'VOI'. You can override using the `.groups` argument.
########## PLOT 
## organize all poi plots in a gallery view 
ggplotly(ggplot(HG002_final_poi_plot_df) +
    geom_bar(aes(x = var_size, y = count), position = position_dodge(preserve = "single"), stat = "identity") +
      scale_x_continuous(breaks = pretty_breaks(n=10)) +
      scale_y_continuous(breaks = pretty_breaks(n=10)) +
      facet_wrap(~VOI) + 
      theme_bw() + 
      theme(axis.text.x = element_text(angle = 90)) + 
      labs(x = "Variant Variant Size", y = "Count", title = "HG002: INDEL Variants per POI: Final POIs") + scale_fill_manual(values=cbPalette))  
## write out final poi df for giab_id
write.table(HG002_final_poi_df, file = here("results/analysis/HG002_final_poi_df.tsv"), sep = "\t", row.names = FALSE)

HG003

HG003_pois <- c("2_49532029_49532030", "3_127859142_127859143", "3_162512133_162512134", 
                "5_117868224_117868225", "6_77437124_77437125", "7_142824835_142824836",
                "9_10163173_10163174", "12_38824344_38824345") ## list of final pois

HG003_final_poi_df <- HG003_variant_df %>% 
  filter(VOI %in% HG003_pois) %>% ## keep only the final pois
  filter(!var_type == "SNP") ## keep INDEL and SNP,INDEL types only

HG003_final_poi_plot_df <- HG003_final_poi_df %>%
  select(VOI, var_type, var_size) %>%
  group_by(VOI, var_size) %>%
  summarise(count = n())  ## count how many of each size per poi
## `summarise()` has grouped output by 'VOI'. You can override using the `.groups` argument.
########## PLOT 
## organize all poi plots in a gallery view 
ggplotly(ggplot(HG003_final_poi_plot_df) +
    geom_bar(aes(x = var_size, y = count), position = position_dodge(preserve = "single"), stat = "identity") +
      scale_x_continuous(breaks = pretty_breaks(n=10)) +
      scale_y_continuous(breaks = pretty_breaks(n=10)) +
      facet_wrap(~VOI) + 
      theme_bw() + 
      theme(axis.text.x = element_text(angle = 90)) + 
      labs(x = "Variant Variant Size", y = "Count", title = "HG003: INDEL Variants per POI: Final POIs") + scale_fill_manual(values=cbPalette))  
## write out final poi df for giab_id
write.table(HG003_final_poi_df, file = here("results/analysis/HG003_final_poi_df.tsv"), sep = "\t", row.names = FALSE)